What is a Search Engine?

A search engine is a software system that helps users find relevant information from a large collection of data by processing queries and returning matching results.

AlgoSearch

DSA Search Engine

It typically works by indexing content (such as web pages or documents), allowing users to perform keyword-based searches, and ranking the results based on relevance.

Popular real-world examples include Google, Bing, and Elasticsearch.

In this chapter, we will explore the low-level design of a simple in-memory search engine.

Let's start by clarifying the requirements:

1. Clarifying Requirements

Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.

Here is an example of how a conversation between the candidate and the interviewer might unfold:

Discussion

Candidate: Implementing web crawling can add significant complexity. Should we preload documents or web pages into the system?

Interviewer: For this version, assume a predefined set of documents or web pages is already available in memory. No need to implement crawling.

Candidate: Should the search engine support only keyword-based search, or also handle phrases queries and logical operators?

Interviewer: Keep it simple for now. Basic keyword-based search is sufficient.

Candidate: Should the system return only exact matches, or also support partial and fuzzy matches?

Interviewer: Let's support exact matches only for now. You can assume case-insensitive search.

Candidate: Do we need to rank the results, or is returning any matching document enough?

Interviewer: Basic scoring and ranking should be implemented (e.g., based on the frequency of the keyword within each document).

Candidate: Should we include text processing techniques like stop-word removal or stemming during indexing and querying?

Interviewer: Not for this version. Treat all words equally. No stop-word removal or stemming.

Candidate: Should we allow users to input search queries dynamically, or can we hardcode a set of search queries?

Interviewer: For this version, assume queries are predefined and supplied via code. No need to handle runtime user input.

After gathering the details, we can summarize the key system requirements.

1.1 Functional Requirements

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 11 2 3 4 5 6 7 8 9 10 11 12 13 2 3 4 5 67 8 9 1011 12 13 14 15 16 17 2 3 4 5 67 8 9 1011 12 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 4849 class=ml-4>

Index a predefined set of documents available in memory.
Support case-insensitive, keyword-based search. Return a list of documents containing the specified keyword.
Support basic ranking of search results (e.g., using keyword frequency within each document)
Provide a simple interface to input queries and display search results

1.2 Non-Functional Requirements

Modularity: The system should follow clean object-oriented design with well-separated responsibilities.
Performance: Search queries should return results quickly, even when handling large sets of documents.
Maintainability: The code should be easy to test, debug, and evolve over time
Memory Efficiency: The indexing structure should be memory-optimized to store and search documents efficiently

2. Identifying Core Entities

Core entities are the fundamental building blocks of our system. We identify them by analyzing the functional requirements and mapping the key responsibilities to object-oriented abstractionsâ€”classes, interfaces, or enums.

Letâ€™s walk through the functional requirements and extract the relevant entities:

1. The system should index a predefined set of documents in memory.

This indicates the need for a Document entity to represent each searchable item. Each document should have a unique identifier and raw text content.

To manage all available documents, we introduce a DocumentStore entity. This serves as an in-memory container that exposes APIs to add and retrieve documents.

For efficient keyword-based retrieval, we require an InvertedIndexâ€”a well-known data structure in search engines. It maps terms (keywords) to the list of documents that contain them, along with metadata such as frequency.

2. The system should return matching documents ranked by keyword frequency.

To support this, we define a Posting entity that represents an occurrence of a term in a document. Each posting includes:

Document ID
Term frequency (i.e., how many times the term appears in the document)

In addition, we introduce a SearchResult entity that packages:

A matched Document
A relevance score (e.g., based on term frequency)

3. The system should process queries and return ranked results.

To orchestrate the entire search pipeline, we define a SearchEngine entity.

It builds the inverted index using the document store.
It accepts queries, applies scoring and ranking strategies, and returns results.

Summary of Core Entities

Document: Represents a searchable item with fields like ID, title, and content.
DocumentStore: Maintains all documents in memory and provides retrieval methods.
InvertedIndex: Core data structure mapping terms to their document postings.
Posting: Represents the occurrence of a term in a document (document ID, frequency, etc.).
SearchResult: Represents a matched document along with metadata and relevance score.
SearchEngine: Coordinates the search process, from indexing and query parsing to retrieval.

These core entities define the essential abstractions of the in-memory search engine and will guide the structure of your low-level design and class diagrams.

3. Designing Classes and Relationships

This section outlines the classes that form the building blocks of the search engine, their responsibilities, and the relationships between them.

3.1 Class Definitions

The system is designed with a clear separation of concerns, categorized into data classes that hold information and core classes that implement the engine's logic.

Data Classes

These are simple Plain Old Java Objects (POJOs) or data containers with minimal logic.

`Document`

Represents a single unit of information to be indexed and searched. It contains a unique id, a title, and its content.

`Posting`

An entry in the inverted index.

It encapsulates the documentId where a term appears and the frequency of that term within the document.

`SearchResult`

A container that pairs a Document with its calculated relevance score, used for ranking and display.

Core Classes

These classes contain the main business logic for indexing, searching, scoring, and ranking.

`DocumentStore`

Acts as an in-memory database, mapping document IDs to Document objects for quick retrieval.

`InvertedIndex`

The central data structure of the engine.

It maps each term (word) to a list of Posting objects, enabling fast lookups of documents containing a specific term.

A concrete implementation of RankingStrategy that sorts by score, using the document title alphabetically as a tie-breaker.

`SearchEngine`

The main orchestrator class.

It provides a simple public API for indexing documents and performing searches, hiding the underlying complexity of the system.

3.2 Class Relationships

The classes interact through a combination of composition, association, and dependency, creating a robust and flexible system.

Composition

The SearchEngine has a strong "owns-a" relationship with its core components.

SearchEngine â—†â”€â”€ InvertedIndex: The SearchEngine creates and manages the lifecycle of its InvertedIndex. The index cannot exist without the engine.
SearchEngine â—†â”€â”€ DocumentStore: Similarly, the DocumentStore is an integral part of the SearchEngine and is managed by it.

Aggregation

The index and store have "has-a" relationships with their data objects.

InvertedIndex â—‡â”€â”€ Posting: The InvertedIndex contains a map of terms to lists of Posting objects. The postings are part of the index but represent data linked to independent documents.
DocumentStore â—‡â”€â”€ Document: The DocumentStore holds a collection of Document objects, which are created externally and added to the store.

Association

The SearchEngine has a "uses-a" relationship with its strategies.

SearchEngine â†’ ScoringStrategy: The SearchEngine holds a reference to a ScoringStrategy object. This allows the scoring algorithm to be changed dynamically (pluggable behavior).
SearchEngine â†’ RankingStrategy: The SearchEngine also holds a reference to a RankingStrategy, allowing the ranking logic to be easily swapped.
SearchResult â†’ Document: Each SearchResult is associated with the Document it represents.

Dependency

Several classes depend on others to perform their tasks, often as method parameters.

SearchEngine depends on Document for indexing and SearchResult for returning search results.
The ScoringStrategy implementations depend on Posting and Document to calculate a score.

3.3 Key Design Patterns

Several design patterns are employed to ensure the system is efficient, scalable, and maintainable.

Strategy Pattern

This pattern is used to make the scoring and ranking algorithms interchangeable.

`ScoringStrategy`

The ScoringStrategy interface and its concrete implementations (TermFrequencyScoringStrategy, TitleBoostScoringStrategy) allow the client to choose how documents are scored without modifying the SearchEngine.

`RankingStrategy`

Likewise, the RankingStrategy interface and its implementations allow the sorting logic for results to be defined and selected at runtime.

Facade Pattern

The SearchEngine class acts as a Facade. It provides a simplified, high-level interface (indexDocuments, search) to the more complex underlying subsystem of indexing, data storage, scoring, and ranking. This decouples the client from the internal workings of the search engine.

Singleton Pattern

The SearchEngine is implemented as a Singleton to ensure there is only one instance managing the index and document store for the entire application. This provides a single, global point of access and prevents inconsistencies from multiple competing instances.

3.4 Full Class Diagram

4. Implementation

4.1 Document

Represents a unit of information indexed by the search engine.

1class Document: def __init__(self, id: str, title: str, content: str): self.id = id self.title = title self.content = content def get_id(self) -> str: return self.id def get_title(self) -> str: return self.title def get_content(self) -> str: return self.content def __str__(self) -> str: return f"Document(id={self.id}, title='{self.title}')"

Each document has:

A unique id
A title and content for search and scoring

4.2 DocumentStore

Acts as an in-memory database for documents.

1class DocumentStore: def __init__(self): self.store: Dict[str, Document] = {} def add_document(self, doc: Document): self.store[doc.get_id()] = doc def get_document(self, doc_id: str) -> Optional[Document]: return self.store.get(doc_id)

Supports retrieval by ID during scoring and search result generation.

4.3 Posting

Encapsulates term-specific metadata within a document. Used as entries in the inverted index.

1class Posting: def __init__(self, document_id: str, frequency: int): self.document_id = document_id self.frequency = frequency def get_document_id(self) -> str: return self.document_id def get_frequency(self) -> int: return self.frequency

documentId: The document where the term appears
frequency: How often the term occurs (used for scoring)

4.4 InvertedIndex

Maps each term to a list of Postings.

1class InvertedIndex: def __init__(self): self.index: Dict[str, List[Posting]] = defaultdict(list) def add(self, term: str, document_id: str, frequency: int): postings = self.index.get(term, []) postings.append(Posting(document_id, frequency)) self.index[term] = postings def get_postings(self, term: str) -> List[Posting]: return self.index.get(term, [])

This is the heart of the search engine that enables fast lookup of documents containing a query term.

An inverted index is the fundamental data structure that makes search engines fast. Instead of scanning every document for a query term (which would be very slow), we pre-process the documents and build a map from each term (word) to a list of documents that contain it.

4.5 SearchResult

Pairs a document with its calculated relevance score. Used for ranking and presenting the final search results to the user.

1class SearchResult: def __init__(self, document: Document, score: float): self.document = document self.score = score def get_document(self) -> Document: return self.document def get_score(self) -> float: return self.score def __str__(self) -> str: return f"  - {self.document.get_title()} (Score: {self.score:.2f})"

4.6 Scoring Strategies

Implements the Strategy pattern for scoring.

1class ScoringStrategy(ABC): @abstractmethod def calculate_score(self, term: str, posting: Posting, document: Document) -> float: pass class=token style=color:rgb(139,233,253)>class TermFrequencyScoringStrategy(ScoringStrategy): def calculate_score(self, term: str, posting: Posting, document: Document) -> float: return posting.get_frequency() class=token style=color:rgb(139,233,253)>class TitleBoostScoringStrategy(ScoringStrategy): TITLE_BOOST_FACTOR = 1.5 def calculate_score(self, term: str, posting: Posting, document: Document) -> float: score = posting.get_frequency() if term in document.get_title().lower(): score *= self.TITLE_BOOST_FACTOR return score

The ScoringStrategy interface defines a contract for any scoring algorithm. The SearchEngine holds a reference to an object of this type. This allows us to easily switch between a simple TermFrequencyScoringStrategy and a more advanced TitleBoostScoringStrategy at runtime.

4.7 Ranking Strategies

Implements the Strategy pattern for ranking.

1class RankingStrategy(ABC): @abstractmethod def rank(self, results: List[SearchResult]): pass class=token style=color:rgb(139,233,253)>class ScoreBasedRankingStrategy(RankingStrategy): def rank(self, results: List[SearchResult]): results.sort(key=lambda x: x.get_score(), reverse=True) class=token style=color:rgb(139,233,253)>class ScoreThenAlphabeticalRankingStrategy(RankingStrategy): def rank(self, results: List[SearchResult]): results.sort(key=lambda x: (-x.get_score(), x.get_document().get_title()))

Similar to scoring, the RankingStrategy allows us to define different ways to order the final results. The ScoreBasedRankingStrategy provides a standard relevance sort, while the ScoreThenAlphabeticalRankingStrategy shows how we can handle tie-breaking gracefully, a common requirement in real-world systems.

4.8 SearchEngine

This class acts as a central Facade and Singleton, orchestrating all the components to provide a simple API for indexing and searching.

1class SearchEngine: _instance = None def __new__(cls): if cls._instance is None: cls._instance = super().__new__(cls) return cls._instance def __init__(self): if hasattr(self, '_initialized'): return self._initialized = True self.inverted_index = InvertedIndex() self.document_store = DocumentStore() self.scoring_strategy = None self.ranking_strategy = None @classmethod def get_instance(cls): if cls._instance is None: cls._instance = cls() return cls._instance def set_scoring_strategy(self, scoring_strategy: ScoringStrategy): self.scoring_strategy = scoring_strategy def set_ranking_strategy(self, ranking_strategy: RankingStrategy): self.ranking_strategy = ranking_strategy def index_documents(self, documents: List[Document]): for doc in documents: self.index_document(doc) def index_document(self, doc: Document): self.document_store.add_document(doc) term_frequencies: Dict[str, int] = {} text = (doc.get_title() + " " + doc.get_content()).lower() tokens = re.split(r'\W+', text) for token in tokens: if token: term_frequencies[token] = term_frequencies.get(token, 0) + 1 for term, frequency in term_frequencies.items(): self.inverted_index.add(term, doc.get_id(), frequency) def search(self, query: str) -> List[SearchResult]: processed_query = query.lower() postings = self.inverted_index.get_postings(processed_query) results = [] for posting in postings: doc = self.document_store.get_document(posting.get_document_id()) if doc is not None: score = self.scoring_strategy.calculate_score(processed_query, posting, doc) results.append(SearchResult(doc, score)) self.ranking_strategy.rank(results) return results

Singleton Pattern: The engine is a Singleton to ensure there is only one instance managing the index and document store for the entire application.
Facade Pattern: It provides a simple, high-level API (indexDocuments, search) that hides the underlying complexity of tokenization, index management, scoring, and ranking.
Indexing Process: The indexDocument method demonstrates the full pipeline: tokenizing text, counting term frequencies, and populating the InvertedIndex.
Search Process: The search method orchestrates the retrieval: it gets candidate documents from the index, uses the injected ScoringStrategy to calculate their relevance, and then uses the injected RankingStrategy to sort them before returning the final list.

4.9 SearchEngineDemo

This driver class shows how a client would interact with the SearchEngine and demonstrates the flexibility of the strategy-based design.

1class SearchEngineDemo: @staticmethod def main(): engine = SearchEngine.get_instance() documents = [ Document("doc1", "Java Performance", "Java is a high-performance language. Tuning Java applications is key."), Document("doc2", "Introduction to Python", "Python is a versatile language, great for beginners."), Document("doc3", "Advanced Java Concepts", "This document covers advanced topics in Java programming."), Document("doc4", "Python vs. Java", "A document comparing Python and Java for web development. Java is faster.") ] print("Indexing documents...") engine.index_documents(documents) print("Indexing complete.\n") print("====== TermFrequency Scoring + ScoreBased Ranking ======") engine.set_scoring_strategy(TermFrequencyScoringStrategy()) engine.set_ranking_strategy(ScoreBasedRankingStrategy()) SearchEngineDemo.perform_search(engine, "java") SearchEngineDemo.perform_search(engine, "language") SearchEngineDemo.perform_search(engine, "performance") print("\n====== TitleBoost Scoring + Score-then-Alphabetical Ranking ======") engine.set_scoring_strategy(TitleBoostScoringStrategy()) engine.set_ranking_strategy(ScoreThenAlphabeticalRankingStrategy()) SearchEngineDemo.perform_search(engine, "java") SearchEngineDemo.perform_search(engine, "language") SearchEngineDemo.perform_search(engine, "performance") SearchEngineDemo.perform_search(engine, "paint") @staticmethod def perform_search(engine: SearchEngine, query: str): print(f"--- Searching for: '{query}' ---") results = engine.search(query) if not results: print("  No results found.") else: for i, result in enumerate(results): print(f"Rank {i + 1}:{result}") print() class=token style=color:rgb(139,233,253)>if __name__ == "__main__": SearchEngineDemo.main()

5. Run and Test

Languages

Java

Python

C++

Files13

entities

strategies

document_store.py

search_engine_demo.py

main

search_engine.py